Hell’s Kitchen Contestants Analysis

Ever since watching the first few seasons of Hell’s Kitchen I was wondering about the demographic break down of the participants and their final ranking. However, I have not actually watched all the seasons so in order to analysis de-identified data so I won’t see spoilers I decided to web scrap the Hell’s Kitchen Wikipedia to look for trends.



The Hell’s Kitchen Wiki has data on all seasons and contestants. There are also helpful tables on the contestants that appear to be standardized, and thus easy to scrape, across the sites. Python code was used in this Rmarkdown file since there are more options for packages that do web-scraping in Python. The below shows the Python code used to get the data.

Here are the libraries I used.


import requests
import pandas as pd
import numpy as np
import plotly.express as px
from bs4 import BeautifulSoup
import re

I started by pulling episode information from the Wiki’s episode list.


url = 'https://hellskitchen.fandom.com/wiki/List_of_Hell%27s_Kitchen_Episodes'
html = requests.get(url).content
df_list = pd.read_html(html)

Every season gets its own table as shown below.


df_list[0]
##       Nº      #                     Title        Air Date
## 0     01     01       Episode 101 - Day 1    May 30, 2005
## 1     02     02       Episode 102 - Day 2    June 6, 2005
## 2     03     03       Episode 103 - Day 3   June 13, 2005
## 3     04     04       Episode 104 - Day 4   June 20, 2005
## 4     05     05       Episode 105 - Day 5   June 27, 2005
## 5     06     06       Episode 106 - Day 6   July 11, 2005
## 6     07     07       Episode 107 - Day 7   July 11, 2005
## 7     08     08       Episode 108 - Day 8   July 18, 2005
## 8     09     09       Episode 109 - Day 9   July 25, 2005
## 9  10/11  10/11  Episode 110/111 - Day 10  August 1, 2005

The current number of seasons can then be calculated by taking 1 from the length of the list (there’s an extra table that’s being pulled).


maxseason = len(df_list)
maxseason = maxseason - 1
maxseason
## 18

The above shows that there are currently 18 seasons. I can start by just pulling the first season’s information.


seasonurl = 'https://hellskitchen.fandom.com/wiki/Season_'
seasonstart = 1
seasonurl1 = seasonurl + str(seasonstart)
htmls1 = requests.get(seasonurl1).content
lists1 = pd.read_html(htmls1)
lists1[0]
##    Hell's Kitchen Seasons  ...  Hell's Kitchen Seasons.19
## 0                       1  ...                         20
## 
## [1 rows x 20 columns]

Since there are only 18 seasons we could manually just create all these lines of code but the below shows a simple loop to get the rest of the season’s information then append onto season 1. I start by renaming some things.


seasondf = lists1[0]
seasondf['Season'] = 1
seasondfs = seasondf

Then we can create a loop that will append the rest of the Seasons to that frame.


for x in range(seasonstart + 1, maxseason + 1):
    sur = seasonurl + str(x)
    htmls = requests.get(sur).content
    lists = pd.read_html(htmls)
    seasondf = lists[0]
    seasondf['Season'] = x
    seasondfs = seasondfs.append(seasondf)
    

The table already has some interesting information about the contestants such as age, occupation, and hometown. But we can visit each contestants’ personal website as well to get even more. It appears that the name of each contestant in the table has a webpage link that replaces the spaces in their name with underscores so contestant Gabriel “Gabe” Gagliardi would have a website name of: https://hellskitchen.fandom.com/wiki/Gabriel_"Gabe"_Gagliardi. Let’s pull this constants website to see what tables we can scrap.


contestanturl = 'https://hellskitchen.fandom.com/wiki/Gabriel_"Gabe"_Gagliardi'
contestant1 = requests.get(contestanturl).content
contestant1tables = pd.read_html(contestant1)
contestant1tables
## [Empty DataFrame
## Columns: [(Week 1, Week 2), (Win, Nominated)]
## Index: [],                              Hell's Kitchen Season 2
##                                                Staff
## 0  Gordon Ramsay (Head Chef) • Scott Leibfried (B...
## 1                                        Contestants
## 2  Giacomo Alfieri • Rachel Brown • Virginia Dalb...]

It doesn’t look like the table we want is being pulled. This means we have to dive more into the webpage.


soup = BeautifulSoup(contestant1, features="lxml")

The object soup now contains a mess of text from the webpage. We know that the title for the variables is in a section called h3 with a class titled pi-data-label pi-secondary-font. Those match up with the answer sections which is in the div section with a class of pi-data-value pi-font. Basically, since the table is not well defined in the web page it’s going to take some more effort to get the information out of it.


soup_titles = soup.find_all('h3', {'class' :'pi-data-label pi-secondary-font'})
soup_titles
## [<h3 class="pi-data-label pi-secondary-font">Hometown</h3>, <h3 class="pi-data-label pi-secondary-font">Age</h3>, <h3 class="pi-data-label pi-secondary-font">Occupation</h3>, <h3 class="pi-data-label pi-secondary-font">Challenges Won</h3>, <h3 class="pi-data-label pi-secondary-font">Services Won</h3>, <h3 class="pi-data-label pi-secondary-font">Times as BoW/Announcer</h3>, <h3 class="pi-data-label pi-secondary-font">Times Nominated</h3>, <h3 class="pi-data-label pi-secondary-font">Placement</h3>, <h3 class="pi-data-label pi-secondary-font">Episode Eliminated</h3>]
soup_answers = soup.find_all('div', {'class' :'pi-data-value pi-font'})
soup_answers
## [<div class="pi-data-value pi-font">Chicago, IL</div>, <div class="pi-data-value pi-font">27</div>, <div class="pi-data-value pi-font">Marketing Executive</div>, <div class="pi-data-value pi-font">0</div>, <div class="pi-data-value pi-font">1</div>, <div class="pi-data-value pi-font">0</div>, <div class="pi-data-value pi-font">1</div>, <div class="pi-data-value pi-font">10th</div>, <div class="pi-data-value pi-font"><a href="/wiki/Episode_202_-_11_Chefs" title="Episode 202 - 11 Chefs">11 Chefs</a></div>]

We can now see a lot of useful information we would want to gather for the contestants such as challenges won, times nominated for elimination, and most importantly their placement. We can pull each question and answer by using the below code.


soup_titles[0]
## <h3 class="pi-data-label pi-secondary-font">Hometown</h3>
soup_answers[0]
## <div class="pi-data-value pi-font">Chicago, IL</div>

We don’t need a lot of the text so we can subset that text to clean it up.


title0 = str(soup_titles[0])
answer0 = str(soup_answers[0])

titleout = re.search('>(.*)<', title0)
titleout.group(1)
## 'Hometown'
answerout = re.search('>(.*)<', answer0)
answerout.group(1)
## 'Chicago, IL'

Now we can use a loop to loop through all the questions. We can create a new table with three columns that have name, title, and answer.


contestanturlb = 'https://hellskitchen.fandom.com/wiki/'
contestantname = 'Gabriel "Gabe" Gagliardi'
contestanturl = contestanturlb + contestantname.replace(" ", "_")
contestant1 = requests.get(contestanturl).content
soup = BeautifulSoup(contestant1, features="lxml")
soup_answers = soup.find_all('div', {'class' :'pi-data-value pi-font'})
soup_titles = soup.find_all('h3', {'class' :'pi-data-label pi-secondary-font'})

Now let’s loop through that data and put it in a data frame starting from zero to max length of titles.


titlelen = len(soup_titles)
Name = contestantname
dfsoup = pd.DataFrame([])


for x in range(0, titlelen):
    titlestr = str(soup_titles[x])
    titleout = re.search('>(.*)<', titlestr)
    Title = titleout.group(1)
    answerstr = str(soup_answers[x])
    answerout = re.search('>(.*)<', answerstr)
    Answer = answerout.group(1)
    dftest = pd.DataFrame(columns=list('ABC'))
    dftest.loc[0] = [Name,Title, Answer]
    dfsoup = dfsoup.append(dftest)

    
dfsoup
##                           A  ...                                                  C
## 0  Gabriel "Gabe" Gagliardi  ...                                        Chicago, IL
## 0  Gabriel "Gabe" Gagliardi  ...                                                 27
## 0  Gabriel "Gabe" Gagliardi  ...                                Marketing Executive
## 0  Gabriel "Gabe" Gagliardi  ...                                                  0
## 0  Gabriel "Gabe" Gagliardi  ...                                                  1
## 0  Gabriel "Gabe" Gagliardi  ...                                                  0
## 0  Gabriel "Gabe" Gagliardi  ...                                                  1
## 0  Gabriel "Gabe" Gagliardi  ...                                               10th
## 0  Gabriel "Gabe" Gagliardi  ...  <a href="/wiki/Episode_202_-_11_Chefs" title="...
## 
## [9 rows x 3 columns]

We already have a data frame that has all the names so let’s create a series using that and create an empty dataframe with three columns. We also need to get unique names only as some participants have been in multiple seasons.

namelist = seasondfs['Contestant']
namelist = list(set(namelist))
dfcontestants = pd.DataFrame(columns=list('ABC'))

Now we know we need the loop above to go through each page but now we need that nested into another loop which will loop through the constant pages. This is how it looks like when you put everything together.

for x in namelist:
    contestanturlb = 'https://hellskitchen.fandom.com/wiki/'
    contestantname = x
    contestanturl = contestanturlb + contestantname.replace(" ", "_")
    contestant1 = requests.get(contestanturl).content
    soup = BeautifulSoup(contestant1, features="lxml")
    soup_answers = soup.find_all('div', {'class' :'pi-data-value pi-font'})
    soup_titles = soup.find_all('h3', {'class' :'pi-data-label pi-secondary-font'})
    titlelen = len(soup_titles)
    Name = contestantname
    for x in range(0, titlelen):
        titlestr = str(soup_titles[x])
        titleout = re.search('>(.*)<', titlestr)
        Title = titleout.group(1)
        answerstr = str(soup_answers[x])
        answerout = re.search('>(.*)<', answerstr)
        Answer = answerout.group(1)
        dftest = pd.DataFrame(columns=list('ABC'))
        dftest.loc[0] = [Name,Title, Answer]
        dfcontestants = dfcontestants.append(dftest)

While there is no gender data listed for the contestants I was able to pull the most common pronoun in their article and that loop is shown below.

for x in namelist:
    contestanturlb = 'https://hellskitchen.fandom.com/wiki/'
    contestantname = x
    contestanturl = contestanturlb + contestantname.replace(" ", "_")
    contestant1 = requests.get(contestanturl).content
    soup = BeautifulSoup(contestant1, features="lxml")
    soup_answers = soup.find_all('div', {'class' :'pi-data-value pi-font'})
    soup_titles = soup.find_all('h3', {'class' :'pi-data-label pi-secondary-font'})
    titlelen = len(soup_titles)
    Name = contestantname
    for x in range(0, titlelen):
        soupstr = str(soup)
        hecount = soupstr.count(" he ")
        hiscount = soupstr.count(" his ")
        hehis = hecount + hiscount
        shecount = soupstr.count(" she ")
        hercount = soupstr.count(" her ")
        sheher = shecount + hercount
        theycount = soupstr.count(" they ")
        theircount = soupstr.count(" their ")
        theytheir = theycount + theircount
        if hehis > sheher and hehis > theytheir:
            pronoun = 'He'
        else:
            if sheher > hehis and sheher > theytheir:
                pronoun = 'She'
            else:
                pronoun = 'They'
        Answer = pronoun
        Title = 'Pronouns'
        dftest = pd.DataFrame(columns=list('ABC'))
        dftest.loc[0] = [Name,Title, Answer]
        dfcontestants = dfcontestants.append(dftest)

So now we have the data we want pulled but there’s one issue.



We can’t analyze raw data so now it needs to be cleaned. Basically, just change the data from long to wide form and then make sure that contestants in multiple seasons have multiple lines with their data corresponding to that season. I also tried to group specific occupations into a more broad category for the contestants.

head(finaldf2)
##             Contestant Age Challenges.Won Kitchen.Experience Season       Job
## 1         Aaron Lhamon  28              4                 NA     13      Cook
## 2          Aaron Smock  22              3                 NA     16 Education
## 3           Aaron Song  48              0                 NA      3      Chef
## 4           Adam Livow  31              4                 NA     14      Chef
## 5          Alan Parker  42              0                 NA     15      Chef
## 6 Alicia "LA" Limtiaco  22              4                  4      5      Cook
##   Place Location Services.Won Time.Nominated Pronoun pronounnum PlaceGroup
## 1     9       MA            4              3      He          0    9th-5th
## 2    12       MI            1              3      He          0      10th+
## 3    10       CA            1              0      He          0    9th-5th
## 4    10       NJ            4              1      He          0    9th-5th
## 5    12       PA            0              0      He          0      10th+
## 6     8       NV            3              2     She          1    9th-5th

Now we can look at what variables relate to participants’ placement. Due to seasons have a varied number of contestants the 12th place is actually placement 12 or larger.

There doesn’t seem to be any relation between age and final placement. Now let’s see if pronoun relates to final placement. The she pronoun was given a value of 1 and the he was given a value of 0 so we can she the percentage of she contestants at each placement.

There is a bit more of a trend with she contestants being more likely to win it seems despite the he and she contestants being very equal overall (the proportion of she is 0.4932432).

Let’s see if trends become more clear if we group placements.

The trend in lower age is still weak but now that the places are grouped we can probably add in standard deviation for a better look.

Even though it’s not strong, the final 4 tend to skew more towards she than he.

Let’s see if occupation category changes over placement.

The occupations distribution of each group also seems very consistent across grouping.

Let’s see how performance relates to grouping. We should see a negative correlation between services won and placement.

We do see a negative correlation (the placement axis is reversed) but it seems to even out at around 6th place. Let’s see how times nominated relates to placement.

Here we see a bit of a U shape which makes sense. The worst performers would only get nominated once and then eliminated while those in the middle might be nominated but weren’t eliminated. People who ended up being in the top 3 seems to have been nominated very little even when there were many times they could have been nominated.

Even though there were only 39 contestants with Kitchen Experience listed it would be interesting to see how that related to placement.

Years of kitchen experience tends to be lower in the higher placed group but there’s still a wide range in those who rank near last and those who rank near first. Interestingly, the middle group had a more dense distribution of years of kitchen experience.

Let’s compare the top two contestants and see if a regression model would tag any of the above discussed variables. The top two was chosen because the top were often viewed by Ramsay as being very similar and this would also increase our sample size for that group rather than just using the 18 winners. “Job: Other” people where removed from analysis due to the small sample size.

  Predictors of Top Two
Predictors Odds Ratios CI p
(Intercept) 0.35 0.04 – 3.45 0.361
Age 0.97 0.90 – 1.03 0.329
Job [Chef] 1.11 0.48 – 2.68 0.813
Job [Cook] 0.72 0.23 – 2.08 0.546
Job [Education] 1.22 0.17 – 5.53 0.816
Job [Private/Personal
Chef]
0.68 0.10 – 2.92 0.641
Pronoun [She] 1.32 0.64 – 2.74 0.452
Observations 285
R2 Tjur 0.009

As thought age, job, and pronoun did not related to placement. For the demographics of the contestants I could pull none of them seem to relate to placement. Overall, winners of Hell’s Kitchen appear to be diverse in a variety of variables.